The sections below give objective criteria for evaluating the usability of machine translation software output.
Contents |
Do repeated translations converge on a single expression in both languages? I.e. does the translation method show stationarity or produce a canonical form. Does the translation become stationary without losing the original meaning? This metric has been criticized as not being well correlated with BLEU (BiLingual Evaluation Understudy) scores[1]
Is the system adaptive to colloquialism, argot or slang? The French language has many rules for creating words in the speech and writing of popular culture. Two such rules are: (a) The reverse spelling of words such as femme to meuf. (This is called verlan.) (b) The attachment of the suffix -ard to a noun or verb to form a proper noun. For example, the noun faluche means "student hat". The word faluchard formed from faluche colloquially can mean, depending on context, "a group of students", "a gathering of students" and "behavior typical of a student". The Google translator as of 28 December 2006 doesn't derive the constructed words as for example from rule (b), as shown here:
Il y a une chorale falucharde mercredi, venez nombreux, les faluchards chantent des paillardes! ==> There is a choral society falucharde Wednesday, come many, the faluchards sing loose-living women!
French argot has three levels of usage:[2]
The United States National Institute of Standards and Technology conducts annual evaluations[1] of machine translation systems based on the BLEU-4 criterion [2]. A combined method called IQmt which incorporates BLEU and additional metrics NIST, GTM, ROUGE and METEOR has been implemeneted by Gimenez and Amigo [3].
Is the output grammatical or well-formed in the target language? Using an interlingua should be helpful in this regard, because with a fixed interlingua one should be able to write a grammatical mapping to the target language from the interlingua. Consider the following Arabic language input and English language translation result from the Google translator as of 27 December 2006 [4]. This Google translator output doesn't parse using a reasonable English grammar:
وعن حوادث التدافع عند شعيرة رمي الجمرات -التي كثيرا ما يسقط فيها العديد من الضحايا- أشار الأمير نايف إلى إدخال "تحسينات كثيرة في جسر الجمرات ستمنع بإذن الله حدوث أي تزاحم". ==> And incidents at the push Carbuncles-throwing ritual, which often fall where many of the victims - Prince Nayef pointed to the introduction of "many improvements in bridge Carbuncles God would stop the occurrence of any competing."
Do repeated re-translations preserve the semantics of the original sentence? For example, consider the following English input passed multiple times into and out of French using the Google translator as of 27 December 2006:
Better a day earlier than a day late. ==> Améliorer un jour plus tôt qu'un jour tard. ==> To improve one day earlier than a day late. ==> Pour améliorer un jour plus tôt qu'un jour tard. ==> To improve one day earlier than a day late.
As noted above and in [1], this kind of round-trip translation is a very unreliable method of evaluation.
An interesting peculiarity of Google Translate as of 24 January 2008 (corrected as of 25 January 2008) is the following result when translating from English to Spanish, which shows an embedded joke in the English-Spanish dictionary which has some added poignancy given recent events:
Heath Ledger is dead ==> Tom Cruise está muerto
This raises the issue of trustworthiness when relying on a machine translation system embedded in a Life-critical system in which the translation system has input to a Safety Critical Decision Making process. Conjointly it raises the issue of whether in a given use the software of the machine translation system is safe from hackers.
It is not known whether this feature of Google Translate was the result of a joke/hack or perhaps an unintended consequence of the use of a method such as statistical machine translation. Reporters from CNET Networks asked Google for an explanation on January 24, 2008; Google said only that it was an "internal issue with Google Translate".[3] The mistranslation was the subject of much hilarity and speculation on the Internet.[4][5]
If it is an unintended consequence of the use of a method such as statistical machine translation, and not a joke/hack, then this event is a demonstration of a potential source of critical unreliability in the statistical machine translation method.
In human translations, in particular on the part of interpreters, selectivity on the part of the translator in performing a translation is often commented on when one of the two parties being served by the interpreter knows both languages.
This leads to the issue of whether a particular translation could be considered verifiable. In this case, a converging round-trip translation would be a kind of verification.